{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# pandas 进阶修炼 ｜早起Python\n",
    "<br>\n",
    "\n",
    "**本习题由公众号【早起Python & 可视化图鉴】 原创，转载及其他形式合作请与我们联系（微信号`sshs321`)，未经授权严禁搬运及二次创作，侵权必究！**\n",
    "\n",
    "\n",
    "\n",
    "本习题基于 `pandas` 版本 `1.1.3`，所有内容应当在 `Jupyter Notebook` 中执行以获得最佳效果。\n",
    "\n",
    "\n",
    "不同版本之间写法可能会有少许不同，如若碰到此情况，你应该学会如何自行检索解决。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9 - 更多未提及的操作\n",
    "\n",
    "\n",
    "\n",
    "<br>\n",
    "\n",
    "\n",
    "**<font color = '#5172F0'><font size=3.5>必读👇👇👇</font>**\n",
    "    \n",
    "在前面 8 章中，我们已经将 pandas 数据分析中最常见的部分大致练习完毕\n",
    "    \n",
    "但是在整理习题的过程中\n",
    "\n",
    "有些很重要或者很实用的操作，很难找到一个合适的章节进行解释\n",
    "    \n",
    "也有些操作，在整理时有所遗漏\n",
    "    \n",
    "因此本章习题就是为了介绍更多重要、实用但未在前面提到的操作。\n",
    "    \n",
    "注意！本章非固定，未来会不断的进行补充！\n",
    "\n",
    "关注公众号「早起Python」第一时间获得最新的版本！\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 初始化\n",
    "\n",
    "<br>\n",
    "\n",
    "该 `Notebook` 版本为**纯习题版**\n",
    "\n",
    "如果需要答案或者提示，可以微信搜索公众号「早起Python」获取！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-1 `map` 与 `applymap`\n",
    "\n",
    "<br>\n",
    "\n",
    "`pandas` 中的 `map` 和 `applymap` 可以对指定列（map）或整个数据框（applymap）工作\n",
    "\n",
    "完成替换、格式化、计算等操作，是 `Pandas` 数据分析中十分重要的工具。\n",
    "\n",
    "为了方便理解，首先执行下方代码创建并查看数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 270,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df1 = pd.DataFrame({'A': ['A0', 'A1', np.nan, 'A3'],\n",
    "                    'B': ['B0',np.nan,'B3',np.nan],\n",
    "                    'C': ['C0','C1','C2',np.nan],\n",
    "                    'D': np.random.randn(4),\n",
    "                    'E': np.random.randn(4),\n",
    "                   'F': np.random.randn(4)},\n",
    "                   index=[0, 1, 2, 3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1 - map｜基本使用\n",
    "\n",
    "将 `df1` 第一列中的 `A0` 替换为 `cat`，`A3` 替换为 `rabbit`，其余为设置为`NaN`（缺失值）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2 - map｜匿名函数\n",
    "\n",
    "在上一题的结果上，将 df1 第 1 列中的字符末尾追加「今天关注了早起Python」"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3 - map｜跳过缺失值\n",
    "\n",
    "上一题中，nan（缺失值）也被同步追加了字符串\n",
    "\n",
    "现在重新对第二列执行同样的操作，并跳过缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4 - map｜自定义函数\n",
    "\n",
    "除了 lambda ，map还可以接受自定义函数，现在对第三列，使用自定义函数完成上一题的任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5 - applymap｜lambda\n",
    "\n",
    "`applymap`可以对整个 `dataframe` 工作，现在将 df1 的最后三列保留两位小数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-2  `stack` 与 `unstack` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6 - stack｜数据堆叠\n",
    "\n",
    "<br>\n",
    "\n",
    "stack字面意思是数据堆叠，但是理解起来就是将数据由宽变长\n",
    "\n",
    "怎样做到？\n",
    "\n",
    "通过**将部分列名拿下来当作索引**来实现，例如下图所示\n",
    "\n",
    "本来应是`2列4行`，但通过 `stack` 可以将列A拿下来当作索引，从而变成`1列8行`\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/reshaping_stack.png)\n",
    "\n",
    "为了复现上面的例子，首先需要执行下方代码来生成数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuples = list(\n",
    "    zip(\n",
    "        *[\n",
    "            [\"bar\", \"bar\", \"baz\", \"baz\", \"foo\", \"foo\", \"qux\", \"qux\"],\n",
    "            [\"one\", \"two\", \"one\", \"two\", \"one\", \"two\", \"one\", \"two\"],\n",
    "        ]\n",
    "    )\n",
    ")\n",
    "index = pd.MultiIndex.from_tuples(tuples, names=[\"first\", \"second\"])\n",
    "df = pd.DataFrame(np.arange(1,17).reshape([8,2]), index=index, columns=[\"A\", \"B\"])\n",
    "df2 = df[:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，按上图所示，对 df2 进行堆叠"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7 - unstack｜逆堆叠\n",
    "\n",
    "对上一题的结果进行还原，即逆堆叠，过程如下图所示\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/reshaping_unstack.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8 - unstack｜层级\n",
    "\n",
    "在使用 unstack 进行逆堆叠时，可以指定层级，例如指定按照 second 进行，也就是如下图所示\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/reshaping_unstack_1.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-3 `isin` 筛选\n",
    "\n",
    "在 `pandas` 中有没有类似 `SQL` 中 `IN` 和 `NOTIN` 的筛选方法？\n",
    "\n",
    "`isin`就可以实现，通过 isin 可以快速筛选出包含某个值的结果\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 271,
   "metadata": {},
   "outputs": [],
   "source": [
    "df3 = pd.DataFrame({'country': ['China','US', 'UK', 'Germany', 'Japan'],\n",
    "             'rank':[1,2,3,4,5]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9 - isin｜根据列表筛选\n",
    "\n",
    "筛选出 `country` 包含 `'China','UK'` 的行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10 - isin｜逆筛选\n",
    "\n",
    "对上一题的结果取逆"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df3[~df3.country.isin(['China','UK'])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-4  `select_dtypes` 筛选\n",
    "\n",
    "<br>\n",
    "\n",
    "`select_dtypes`  可以筛选制定数据类型的列\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 11 - select_dtypes｜单类型\n",
    "\n",
    "筛选 df4 数据类型为整数的列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12 - select_dtypes｜多类型\n",
    "\n",
    "筛选 df4 数据类型为和浮点数的列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13 - select_dtypes｜排除\n",
    "\n",
    "筛选 df4 数据类型为布尔值的列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  9-5 `explode` 数据展开\n",
    "\n",
    "<br>\n",
    "\n",
    "有时我们的数据中会包含列表，此时便可使用  `explode` 进行展开，将一个list拆成多行\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [],
   "source": [
    "df5 = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],\n",
    "                   'B': 1,\n",
    "                   'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 14 - explode｜单列\n",
    "\n",
    "将 df5 第 A 列进行展开"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 15 - explode｜多列\n",
    "\n",
    "将 df5 第 A、C 列进行展开"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-6 `nunique` 统计\n",
    "\n",
    "<br>\n",
    "\n",
    "`nunique` 可以统计指定轴上不唯一的元素数量\n",
    "\n",
    "[👉对应官方文档](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html)\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 272,
   "metadata": {},
   "outputs": [],
   "source": [
    "df6 = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 16 - nunique｜按列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 17 - nunique｜按行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "微信搜索公众号「早起Python」，关注后可以获得更多资源！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-7  `cumsum` 计算\n",
    "\n",
    "cumsum 可以对数据按照指定方式进行累加\n",
    "\n",
    "[👉官方文档](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cumsum.html)\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [],
   "source": [
    "df7 = pd.DataFrame(np.arange(1,37).reshape([9,4]), columns=[\"A\", \"B\",\"C\",\"D\"])\n",
    "df7['item'] = ['Apple','Xiaomi','Huawei'] * 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 18 - cumsum｜按列\n",
    "\n",
    "将 df7 按列进行累加"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 19 - cumsum｜按行\n",
    "\n",
    "将 df7 按行进行累加"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 20 - cumsum｜按组\n",
    "\n",
    "将 df7 按照 `item` 按不同组对第 A 列进行累加"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-8 `append`｜添加\n",
    "\n",
    "在很多教程，包括 [pandas 官方文档](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#appending-rows-to-a-dataframe)中，都将 append 结合 merge、concat、join 一起讲解\n",
    "\n",
    "但是对我来说，虽然append得到的结果也类似合并，可它常常出现的地方就是它的字面意思 -> 添加（追加）\n",
    "\n",
    "所以我将在这里介绍 append\n",
    "\n",
    "下面是几个 append 的常用操作，为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 257,
   "metadata": {},
   "outputs": [],
   "source": [
    "df8 = pd.DataFrame(\n",
    "    {\n",
    "        \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n",
    "        \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n",
    "        \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n",
    "        \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n",
    "    },\n",
    "    index=[0, 1, 2, 3],\n",
    ")\n",
    "\n",
    "s2 = pd.Series([\"X0\", \"X1\", \"X2\", \"X3\"], index=[\"A\", \"B\", \"C\", \"D\"])\n",
    "s3 = pd.DataFrame({\"A\": ['s1'],\"B\": ['s2'],\"C\": ['s3'],\"D\": ['s4']})\n",
    "dicts = [{\"A\": 1, \"B\": 2, \"C\": 3, \"X\": 4}, {\"A\": 5, \"B\": 6, \"C\": 7, \"Y\": 8}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 21 - append｜末尾追加\n",
    "\n",
    "将 s2 添加至 df8 的末尾\n",
    "\n",
    "![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_append_series_as_row.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 22 - append｜指定位置追加\n",
    "\n",
    "将 s3 添加至 df8 的第三行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 23 - append｜添加字典\n",
    "\n",
    "将下面的字典 dicts 插入添加至 df8，并保留索引，如下图所示\n",
    "\n",
    "![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_append_dits.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9-9 `compare` 比较\n",
    "\n",
    "<br>\n",
    "\n",
    "`compare` 用于比较两个数据框之间的差异\n",
    "\n",
    "[👉官方文档](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html)\n",
    "\n",
    "\n",
    "为了方便练习，首先需要执行下面的代码生成示例数据，并应简单查看一下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 269,
   "metadata": {},
   "outputs": [],
   "source": [
    "df9 = pd.DataFrame(\n",
    "    {\n",
    "        \"col1\": [\"a\", \"a\", \"b\", \"b\", \"a\"],\n",
    "        \"col2\": [1.0, 2.0, 3.0, np.nan, 5.0],\n",
    "        \"col3\": [1.0, 2.0, 3.0, 4.0, 5.0]\n",
    "    },\n",
    "    columns=[\"col1\", \"col2\", \"col3\"],\n",
    ")\n",
    "\n",
    "\n",
    "df10 = df9.copy()\n",
    "df10.loc[0, 'col1'] = 'c'\n",
    "df10.loc[2, 'col3'] = 4.0\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24 - compare｜常规\n",
    "\n",
    "输出 df9 和 df10 的差异"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 25 - compare｜保留数据框\n",
    "\n",
    "在上一题的要求下，保留原数据框"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 26 - compare｜保留值\n",
    "\n",
    "在上一题的基础上，再保留原始相同的值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](http://liuzaoqi.oss-cn-beijing.aliyuncs.com/2021/09/16/16317972442543.jpg?域名/sample.jpg?x-oss-process=style/stylename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
